Topological Bias in Distance-Based Phylogenetic Methods: Problems with Over- and Underestimated Genetic Distances

I show several types of topological biases in distance-based methods that use the least-squares method to evaluate branch lengths and the minimum evolution (ME) or the Fitch-Margoliash (FM) criterion to choose the best tree. For a 6-species tree, there are two tree shapes, one with three cherries (a cherry is a pair of adjacent leaves descending from the most recent common ancestor), and the other with two. When genetic distances are underestimated, the 3-cherry tree shape is favored with either the ME or FM criterion. When the genetic distances are overestimated, the ME criterion favors the 2-cherry tree, but the direction of bias with the FM criterion depends on whether negative branches are allowed, i.e. allowing negative branches favors the 3-cherry tree shape but disallowing negative branches favors the 2-cherry tree shape. The extent of the bias is explored by computer simulation of sequence evolution.


Introduction
Topological bias of phylogenetic methods has been noted a long time ago as a potential source of bias in the study of speciation processes (Huelsenbeck and Kirkpatrick, 1996). Understanding the bias in different phylogenetic methods can help us explain previously unexpected observations in phylogenetic studies (Bruno and Halpern, 1999;Hillis, 1998;Huelsenbeck, 1998;Purvis and Agapow, 2002;Swofford et al. 2001). For example, a rooted tree with eight operational taxonomic units (OTUs) has many fewer possible perfectly symmetrical trees than maximally asymmetric trees. Consequently, a phylogenetic algorithm that picks up random trees, or even a good phylogenetic algorithm working on data that have already lost almost all the phylogenetic information such as extremely diverged sequences, would be more likely to end up with a maximally asymmetrical tree than a perfectly symmetrical tree (Huelsenbeck and Kirkpatrick, 1996).
Topological bias may confound the evaluation of the relative performance of phylogenetic methods in studies (e.g. Yang, 1997) that use a model tree to simulate sequence evolution and evaluate phylogenetic algorithms by checking which one is the most effi cient in recovering the model tree. For example, if a 4-OTU model tree has two sister OTUs with long branches, then phylogenetic algorithms, especially the maximum parsimony method, that suffers from the long-branch attraction problem will tend to be the most effi cient in recovering the known tree and may be misconstrued to be the best algorithm (Bruno and Halpern, 1999).
Here I report several kinds of topological biases in phylogenetic reconstruction by the distance-based methods based on the minimum evolution or Fitch-Margoliash criterion in selecting the best topology. The distance-based methods for phylogenetic reconstruction have several advantages over maximum parsimony and maximum likelihood methods. First, they are typically fast. Second, one can implement complicated substitution models such as those underlying the paralinear and Log-Det distances that would be diffi cult to implement in a maximum likelihood framework, although such attempt has been made recently (Jayaswal et al. 2005). Third, they appear to suffer less from the inconsistencies reported for maximum parsimony methods. Forth, they are better than maximum parsimony methods in estimating divergence time because of its model-based correction for multiple hits. For these and perhaps many other reasons, the distance-based methods have been used widely in molecular phylogenetics, especially with a large number of OTUs or in large-scale simulations (e.g. Xia et al. 2003b). In particular, the simplicity of the distance-based methods can often allow researchers to identify potential bias in reconstructed phylogenetic trees more readily than other methods (e.g. Xia et al. 2003a Topological bias associated with the leastsquares method and the minimum evolution (ME) criterion has previously been illustrated with four OTUs (operational taxonomic units) (Xia, 2006). Here I further explore the bias with six OTUs and with both the ME and the Fitch-Margoliash (FM) criterion. The reason for using more than four or fi ve species is because a 6-OTU tree has two tree shapes (Felsenstein, 2004, p. 33) whereas a 4-OTU or 5-OTU tree has only one tree shape and consequently is not useful to explore bias associated with tree shapes.
I will fi rst briefl y describe the statistical features common to frequently used distance-based methods and illustrate the systematic bias shared among these distance-based methods when the minimum evolution (ME) or Fitch-Margoliash (FM) criterion is used in choosing the best tree. This is followed by computer simulations to explore the extent of the bias.

Mechanistic Illustration of the Bias
Let us start with the two contrasting topologies A and B (Fig. 1). Designate D ij as the genetic distance between OTUs i and j. The least-squares estimate of x i for Topology A (Fig. 1a) is: The x i values for Topology B (Fig. 1b) are, To save space in writing equations, I will equate: According to the minimum evolution (ME) criterion, the best tree is the one with the shortest tree length (TL). Designating the tree length of Topology A and Topology B as TL a and TL b , respectively, we have: The tendency of favoring Topology B, measured as (TL a -TL b ), is which measures the tendency to choose Topology B over Topology A. In other words, we choose Topology B if T b.ME > 0 or Topology A if T b.ME < 0. When T b.ME = 0, the two trees are equally good based on the ME criterion. T b.ME is expected to be 0 in two situations: (1) when x 7 = 0 in Topology A and Topology B in Fig. 1 in which case the two topologies converge to Topology C in Fig. 1c, and (2) when sequences experienced full substitution saturation so that all D ij values are expected to be the same.
For the BME algorithm (Desper and Gascuel, 2002), the corresponding equation (derived by O. Gascuel, pers. comm.) is It is interesting to note that, although equation (6) differ in form from equation (5), T b.BME is expected to be 0 in two identical situations mentioned in the previous paragraph: (1) when x 7 = 0 in Topology A and Topology B in Fig. 1 and (2) when sequences experienced full substitution  Comparison between allowing (a) and disallowing (b) negative branch lengths, given that the true tree is Topology C in Fig. 1 with all branch lengths set to 0.1 and that D 56 is the only genetic distance estimated inaccurately. T b.ME , T b.FM0 and T bFM2 are explained in the text. saturation so that all D ij values are expected to be the same.
The FM criterion for choosing the best tree is based on where N OTU is the number of OTUs. It is important not to confuse the FM method for evaluating branch lengths with the FM criterion for choosing the best tree. The branch length of a particular topology can be evaluated by either the leastsquares method or the FM method, and the best tree can be chosen based on either the ME criterion which takes the shortest tree as the best tree or the FM criterion which takes the tree with smallest SS as the best tree.
In the FITCH program in PHYLIP, the default value for P in equation (7) is 2 but it can be 0 or any other value. Here I consider only two cases, with P = 0 (which means that the denominator will always be 1) and P = 2 (the default in PHYLIP's FITCH program).
When P = 0, the expressions for SS in equation (7) are relatively simple for Topology A and Topology B, and their difference can be written as where the subscript FM0 in T b.FM0 indicates the FM criterion with P = 0 and A, B, and C are defi ned in equation (3). The interpretation of T b.FM0 is the same as T b.ME in equation (7). That is, we choose When T b.FM0 = 0, the two trees are equally good based on the FM0 criterion. I will use T b.FM2 to represent (SS s -SS a ) for P = 2 (which is the default in PHYLIP's FITCH program). The algebraic expression for T b.FM2 is cumbersome to write down and only the computational results will be presented and compared with T b.ME and T b.FM0 . Consider fi rst Topology C in Fig. 1c with all branch lengths equal to 0.1. Note that Topology A and Topology B converges to Tree C when x 7 approaches 0. When there is no error in estimating D ij , then T b.ME , T b.FM0 and T b.FM2 (not shown) are all equal to 0 for Topology C, which is expected because Topology C is exactly intermediate between Topology A and Topology B. It is known that, if D ij values are estimated accurately, then the application of the ME criterion does not have bias favoring any particular topology (Bryant and Wadell, 1998;Rzhetsky and Nei, 1993).
I now consider a special case when D 56 is underestimated or overestimated, i.e. when the estimated D 56 is different from the true value of 0.2 (Fig. 1c). In reality, because of the shared branches between some distances, the error will not be limited to just one distance. However, examining this special case will shed light on various aspects of the topological biases arising from the application of the ME and the FM criteria.
Negative branch lengths allowed ME criterion: T b.ME in equation (5) increases linearly with the overestimation of D 56 (Fig. 2). When D 56 is underestimated, T b.ME is smaller than 0 and Topology A will be the ME tree although the true tree is Topology C. When D 56 is overestimated, T b.ME will be greater than 0 and Topology B will be the ME tree. The p distance, as well as other distances based on simple substitution models such as the JC69 model, tend to underestimate the true distance and will consequently tend to favor Topology A against Topology B. This Table 1. The effect of inaccurate estimation of D 56 (= x 8 + x 9 ) when the true tree is Topology C in Fig. 1 with all branch lengths equal to 0.1, except for x 8 and x 9 . The x 7 values are from the least-squares evaluation with Topology A. T b.ME , T b.FM0 and T bFM2 are explained in the text. suggests that some proposed guidelines favoring the use of simple distances (Nei, 1996;Nei and Kumar, 2000, pp. 112-113) may not be appropriate because they tend to favor Topology A against Topology B. In contrast, gamma-corrected distances, especially those assuming a large proportion of invariant sites, will tend to overestimate the true distance and will consequently favor Topology B against Topology A. This may be partially responsible for the poor performance of the HKY gamma distance with the ME criterion (Takahashi and Nei, 2000) because their simulated sequences are short, leading to large variances associated with D ij and frequent underestimation or overestimation of D ij . FM criterion: Topology A is always favored (Fig. 2a). This is easy to understand. When D 56 is underestimated, the least-squares method will fi nd a positive x 7 so that SS a in equation (7) is 0 (Table 1) and is always smaller than SS b . So T b.FM0 is negative and Topology A is the best. When D 56 is overestimated, the least-squares method will fi nd a negative x 7 so that SS a is again 0 (Table 1) and smaller than SS b . So again T b.FM0 is negative and Topology A is the best. This suggests that the topological bias associated with the overestimation of D 56 may be alleviated by disallowing negative branches. Disallowing negative branches is the default in PHYLIP's FITCH program and MEGA, as well as many others. Previous simulations have shown such treatment to signifi cantly improve the performance of distance-based methods (Kuhner and Felsenstein, 1994).

Negative branch lengths not allowed
With Topology A, when D 56 is underestimated, there is no negative branch length when branch lengths are evaluated by the least-squares method, so disallowing negative branch lengths has no effect on SS a . However, with Topology B, x 7 may become negative with an underestimated D 56 , so disallowing negative branch lengths will affect SS b . Similarly, when D 56 is overestimated, disallowing negative branch lengths will not affect SS b because all branch lengths for Topology B will be positive from the least-squares method. However, the overestimation of D 56 will leads to a negative x 7 for Topology A. So SS a will be affected when negative branch lengths are not allowed. I numerically illustrate below the effect of disallowing negative branch lengths.
The common treatment of negative branches is to set them to zero and re-estimated the length of other branches. Thus, when D 56 is underestimated leading to a negative x 7 in Topology B, we will set x 7 = 0 and re-estimate the lengths of other branches in Topology B. Similarly, when D 56 is overestimated leading to a negative x 7 in Topology A, we will again set x 7 = 0 and re-estimate the lengths of other branches in Topology A. Note that here we do not need the iterative method for estimating branch lengths with the weighted least-squares method (Felsenstein, 1997). The re-estimation results in two sets of formulae of T b.ME , T b.FM0, and T b.FM2 , with one set for underestimated D 56, and another for overestimated D 56 .
Underestimation of D 56 : The mathematical expressions of T b.ME and T b.FM0 are written below (but that for T b.FM2 is cumbersome and only its numerical result will be presented for comparison): There are three points worth highlighting. First, both T b.ME, D56≤0.2 and T b.FM0, D56≤0.2 approach zero when D 56 approaches the true value of 0.2, which is expected because the ME and the FM criteria are not biased when D ij values are accurately estimated (Bryant and Wadell, 1998;Rzhetsky and Nei, 1993). Second, T b.ME, D56≤0.2 is a linearly increasing function of the underestimated D 56 with a slope of 3/14 in equation (9), which is smaller than the slope of 5/18 in equation (5). This means that the bias in favor of Topology A with an underestimated D 56 is less serious when negative branch lengths are not allowed than when negative branch lengths are allowed. Third, T b.FM0 has a negative quadratic term such that T b.FM0 will approach zero asymptotically, instead of linearly as T b.ME , when D 56 approaches the true value of 0.2. In particular, when D 56 is underestimated, disallowing negative branch lengths does not help with the FM criterion as we can see from Fig. 2. The topological bias for D 56 < 0.2 is in fact greater when negative branches are not allowed (Fig. 2b) than when negative branches are allowed (Fig. 2a).
Overestimation of D 56 : The results are quite different when D 56 is overestimated. Now we have , .
There are again three points worth highlighting. First, T b.ME increases linearly (i.e. Topology B will be increasingly favored) with overestimation of D 56 . However, the slope ( = 4/63) is much smaller than that for underestimated D 56 shown in equation (9) where the slope is 3/14. This means that the topological bias associated with overestimating D 56 will not be as serious as underestimating D 56 when negative branch lengths are not allowed. This suggests that, when negative branch lengths are not allowed, overestimation of D ij is not as problematic as underestimation of D ij . Therefore, the advice of using overly simple distances that tend to underestimation D ij (Nei, 1996;Nei and Kumar, 2000, pp. 112-113) should be taken with caution. Second, for T b.FM0 , contrary to the negative quadratic term of D 56 in equation (8) where negative branch lengths are allowed, the quadratic term of D 56 in equation (10) is positive. This means that the application of the FM criterion will no longer favor Topology A against Topology B as it did before when negative branch lengths are allowed. Instead, the topological bias has changed direction when we change from allowing negative branch lengths to disallowing negative branch lengths.
The topological bias associated with the inaccurate estimation of D 56 for our special case, when negative branch lengths are not allowed, is illustrated in Fig. 2b. I highlight two points. First, when D ij are accurate, then the FM criterion is better than the ME criterion (e.g. when D 56 is within the range of 0.19-0.21 in Fig. 2). Second, when the error associated with D ij is suffi ciently large, then at least the FM2 criterion is worse than the ME criterion (e.g. when D 56 is < 0.17 or > 0.25 in Fig. 2). While previous simulations suggest that disallowing negative branches may signifi cantly improve the performance of distance-based methods (Kuhner and Felsenstein, 1994), our results show that the improvement may only be seen in cases where D ij values are estimated with great accuracy.
The various biases under different conditions are summarized in Table 2 to facilitate cross-reference.

Simulation and Discussion
Given the topological bias identifi ed in the previous section, I used simulation to further explore the extent of the bias in practical scenarios. I used the EVOLVER program in the PAML package (http:// abacus.gene.ucl.ac.uk/software/paml.html) to simulation sequence evolution with Tree A and Tree B in Fig. 3 which correspond to Topology A and Topology B (Fig. 1), respectively. Phylogenetic analysis was carried out by using DAMBE (Xia, 2001;Xia and Xie, 2001).
The substitution model used in the simulation is K80 (Kimura, 1980), with κ = 5 and equal nucleotide frequencies, and with no rate heterogeneity over sites. The sequence length was set to 1500 bases and the simulation of sequence evolution was repeated 500 times for each tree.
I used the JC69 and F84 distances to compute genetic distances. Because the sequences are generated with the K80 model, D JC will tend to underestimate D K80 when Q ≠ 2P (where P and Q are the proportion of sites with transitional and transversion substitutions, respectively) due to stochastic noise, and the resulting tree is expected to exhibit the bias associated with underestimation of D ij . In contrast, D F84 tends to overestimate D K80 whenever nucleotide frequencies are different from 0.25 and the resulting tree is expected to exhibit the bias associated with overestimation of D ij .

Simulation with Tree A
Underestimation of D 56 : For sequences generated with the two-parameter K80 model and Tree Table 2. Summary of topological biases showing which of the two topologies (A and B) is favored under different conditions. Strong bias is indicated by bold font.

Neg. branch (1) ME FM0 FM2
(1) Negative branches. A in Fig. 3, but D ij is estimated with D JC , D ij values are biased towards underestimation. In particular, D 56 tends to be most severely underestimated. According to Table 2, the underestimated D 56 should favor Topology A, regardless of which of the ME and FM criteria is used and whether negative branch lengths are allowed. Note that the bias here actually favors the recovery of the true tree. The six consensus trees from 500 sets of sequences ( Fig. 4) are consistent with the prediction. There is substantial difference in bootstrap values between the consensus tree in Fig. 4a (with negative branch length allowed), and that in Fig. 4b (with no negative branch length allowed), based on the ME criterion. This is expected from Table 2. With an underestimated D 56 , the ME criterion is more biased towards Topology A when negative branch lengths are allowed than when negative branch lengths are not allowed (Table 2). So we expect the bootstrap support for Topology A to be stronger when negative branch lengths are allowed (Fig. 4a) than when negative branch lengths are not allowed (Fig. 4b).
Overestimation of D 56 : For the same 500 sets of sequences generated with the K80 model and Tree A, but D ij estimated with the F84 model, D ij values (especially D 56 ) are biased towards overestimation. Table 2 suggests two predictions. First, the overestimated D 56 should favor Topology B with the ME criterion, regardless of whether negative branches are allowed or not. Second, with the FM criterion, Topology A is favored if negative branches are allowed, but Topology B is favored if negative branch lengths are not allowed. These two predictions are supported (Fig. 5). First, both consensus trees based on the ME criterion ( Fig. 5a and b, one with negative branches allowed and the other not) assume Topology B. Second, for the FM criterion, the consensus tree allowing negative branches ( Fig. 5c and e) exhibit Topology A, whereas the two consensus trees (Fig. 5d and f) not allowing negative branches have Topology B. For the consensus trees in Fig. 5d and Fig. 5f, the bootstrap values between OTUs 5 and 6 (265 and 267, respectively, in Fig. 5d and Fig. 5f) are mainly due to the fact that, among the 500 trees, OTU 5 is positioned closer to OTUs 1 and 2 about half of the time and to OTUs 3 and 4 another half of the time. The tree shape (Felsenstein, 2004, p. 33) is strongly supported.

Simulation with Tree B
Underestimated D 56 : For sequences generated with the two-parameter K80 model, but D ij is estimated with D JC , D ij values are biased towards underestimation. In particular, D 56 tends to be most severely underestimated. According to Table  2, the underestimated D 56 should favor Topology A even the model tree has Topology B, regardless of which of the ME and FM criteria is used and whether negative branch lengths are allowed. The consensus trees from 500 sets of sequences (Fig. 6) substantiate this prediction. It is remarkable that, while the model tree has Topology B, most reconstructed trees recovered Topology A (Fig. 6 a, c, d, f). This is a powerful demonstration of the topological bias. The bias associated with the ME criterion in favor of Topology A should be alleviated by not allowing negative branches (Table 2). This explains why the consensus tree allowing negative branches (Fig. 6a) assumes Topology A whereas the consensus tree not allowing negative branches (Fig. 6b) assumes Topology B. We also note that not allowing negative branches makes the FM2 criterion even more biased in favor of Topology A (Table 2). This explains why the consensus tree in Fig. 6f assumes Topology A whereas the consensus tree in Fig. 6e assumes Topology B.
Overestimated D 56 : For the same 500 sets of sequences generated with the K80 model and Tree B in Fig. 3, but D ij estimated with the F84 model, D ij values (especially D 56 ) are biased towards overestimation. According to Table 2, the overestimated D 56 should favor Topology B with the ME criterion. This is supported because both consensus trees ( Fig. 7a and b) exhibit Topology B.
For the FM criterion, the topological bias associated with the overestimation of D 56 will be in opposite direction depending on whether negative branch lengths are allowed or not. According to Table 2, the FM criterion will favor Topology A when negative branch lengths are allowed, but Topology B when negative branch lengths are not allowed. This prediction is clearly substantiated by the consensus trees obtained with the FM criterion, especially with the FM2 criterion ( Fig. 7e-f), where the consensus tree assumes Topology A when negative branch lengths are allowed but Topology B when negative branch lengths are not allowed.
I should finally discuss four issues closely related to our study. First, our result highlights the inadequacy of many studies evaluating relative performance of phylogenetic algorithms involving only four OTUs. The topological bias in the distance-based methods shown in our paper cannot be revealed unless one has at least six OTUs because, for unrooted trees with fewer OTUs, there is only one unrooted bifurcating tree shape.
Second, when a large number of OTUs are included in a phylogenetic study, it is highly likely that different subtrees may be governed by different substitution models and the genetic distance based on any one particular distance may overestimate some distances and underestimate others. This may cause problems in building very large trees with different subtrees suffering from different topological biases.
Third, the current paper is limited in at least two major ways. First, much of the analysis is based on the consequence of overestimation or underestimation of D 56 . Second, it does not provide a large-scale and realistic simulation to explore the consequence of the bias in practice. We are currently addressing these two problems.